Bimodal Content Defined Chunking for Backup Streams
نویسندگان
چکیده
Data deduplication has become a popular technology for reducing the amount of storage space necessary for backup and archival data. Content defined chunking (CDC) techniques are well established methods of separating a data stream into variable-size chunks such that duplicate content has a good chance of being discovered irrespective of its position in the data stream. Requirements for CDC include fast and scalable operation, as well as achieving good duplicate elimination. While the latter can be achieved by using chunks of small average size, this also increases the amount of metadata necessary to store the relatively more numerous chunks, and impacts negatively the system’s performance. We propose a new approach that achieves comparable duplicate elimination while using chunks of larger average size. It involves using two chunk size targets, and mechanisms that dynamically switch between the two based on querying data already stored; we use small chunks in limited regions of transition from duplicate to nonduplicate data, and elsewhere we use large chunks. The algorithms rely on the block store’s ability to quickly deliver a high-quality reply to existence queries for alreadystored blocks. A chunking decision is made with limited lookahead and number of queries. We present results of running these algorithms on actual backup data, as well as four sets of source code archives. Our algorithms typically achieve similar duplicate elimination to standard algorithms while using chunks 2–4 times as large. Such approaches may be particularly interesting to distributed storage systems that use redundancy techniques (such as error-correcting codes) requiring multiple chunk fragments, for which metadata overheads per stored chunk are high. We find that algorithm variants with more flexibility in location and size of chunks yield better duplicate elimination, at a cost of a higher number of existence queries.
منابع مشابه
An Efficient Data Deduplication based on Tar-format Awareness in Backup Applications
Disk-based backup storage system is utilized widely, and data deduplication is becoming an essential technique in the system because of the advantage of a spaceefficiency. Usually, user’s several files are aggregated into a single Tar file at primary storage, and the Tar file is transferred and stored to the backup storage system periodically (e.g., a weekly full backup) [1]. In this paper, we ...
متن کاملA Relational Approach to Querying Streams
Data streams are long, relatively unstructured sequences of characters that contain information such as electronic mail or a tape backup of various documents and reports created in an office. This paper deals with a conceptual framework, using relational algebra and relational databases, within which data streams may be queried. As information is extracted from the data stream, it is put into a...
متن کاملLeap-based Content Defined Chunking - Theory and Implementation
Content Defined Chunking (CDC) is an important component in data deduplication, which affects both the deduplication ratio as well as deduplication performance. The sliding-window-based CDC algorithm and its variants have been the most popular CDC algorithms for the last 15 years. However, their performance is limited in certain application scenarios since they have to slide byte by byte. The a...
متن کاملThe assignment of chunk size according to the target data characteristics in deduplication backup system
This paper focuses on the trade-off between the deduplication rate and the processing penalty in backup system which uses a conventional variable chunking method. The trade-off is a nonlinear negative correlation if the chunk size is fixed. In order to analyze quantitatively the trade-off all over the factors, a simulation approach is taken and clarifies the several correlations among chunk siz...
متن کاملSurvey of Research on Chunking Techniques
The explosive growth of data produced by different devices and applications has contributed to the abundance of big data. To process such amounts of data efficiently, strategies such as De-duplication has been employed. Among the three different levels of de-duplication named as file level, block level and chunk level, De-duplication at chunk level also known as byte level is the most popular a...
متن کامل